Skip to content

Wrap try_init_host futures in tokio::spawn so they don't get cancelled#2857

Merged
gefjon merged 1 commit into
masterfrom
phoebe/dont-drop-try-init
Jun 13, 2025
Merged

Wrap try_init_host futures in tokio::spawn so they don't get cancelled#2857
gefjon merged 1 commit into
masterfrom
phoebe/dont-drop-try-init

Conversation

@gefjon
Copy link
Copy Markdown
Contributor

@gefjon gefjon commented Jun 12, 2025

Description of Changes

Comment pasted from within watch_maybe_launch_module_host:

try_init_host is not cancel safe, as it will spawn other async tasks
which hold a filesystem lock past when try_init_host returns or is cancelled.
This means that, if try_init_host is cancelled, subsequent calls will fail.

This is problematic because Axum will cancel its handler tasks if the client disconnects,
and this method is called from Axum handlers, e.g. for the subscribe route.
tokio::spawn a task to build the Host and install it in the guard,
so that it will run to completion even if the caller goes away.

Note that tokio::spawn only cancels its tasks when the runtime shuts down,
at which point we won't be calling try_init_host again anyways.

This fixes (hopefully) the dreaded replay-loop bug, which we belive was caused by Axum dropping the watch_maybe_launch_module_host future after it acquired the Lockfile and leaked that file to a non-cancel-able async task, leading subsequent calls to fail due to the Lockfile still being held. Now, any call to watch_maybe_launch_module_host which enters try_init_host will run to completion and install the new Host into the HostController, even if the calling future is dropped or cancelled.

Expected complexity level and risk

1: wrapping already-async stuff in tokio::spawn is a minimal change.

Testing

  • Manual testing in a production-like cluster to see if this fixes the replay loop.

…elled

Comment pasted from within `watch_maybe_launch_module_host`:

> `try_init_host` is not cancel safe, as it will spawn other async tasks
> which hold a filesystem lock past when `try_init_host` returns or is cancelled.
> This means that, if `try_init_host` is cancelled, subsequent calls will fail.
>
> This is problematic because Axum will cancel its handler tasks if the client disconnects,
> and this method is called from Axum handlers, e.g. for the subscribe route.
> `tokio::spawn` a task to build the `Host` and install it in the `guard`,
> so that it will run to completion even if the caller goes away.
>
> Note that `tokio::spawn` only cancels its tasks when the runtime shuts down,
> at which point we won't be calling `try_init_host` again anyways.

This fixes (hopefully) the dreaded replay-loop bug,
which we belive was caused by Axum dropping the `watch_maybe_launch_module_host` future
after it acquired the `Lockfile` and leaked that file to a non-cancel-able async task,
leading subsequent calls to fail due to the `Lockfile` still being held.
Now, any call to `watch_maybe_launch_module_host` which enters `try_init_host`
will run to completion and install the new `Host` into the `HostController`,
even if the calling future is dropped or cancelled.
@gefjon gefjon requested a review from jsdt June 12, 2025 21:37
@gefjon
Copy link
Copy Markdown
Contributor Author

gefjon commented Jun 12, 2025

@Kasama reports that manual testing in a production-like deployment no longer showed the bug in question.

Copy link
Copy Markdown
Contributor

@jsdt jsdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks safe to do

Comment thread crates/core/src/host/host_controller.rs
@gefjon gefjon added this pull request to the merge queue Jun 13, 2025
Merged via the queue into master with commit 2d756db Jun 13, 2025
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants